PARQUET-79: add a streaming Thrift API, to enable processing the metadata as we read it and skipping unnecessary fields. #8

julienledem · 2014-08-23T00:16:54Z

This pull request provides an API to read thrift in a streaming fashion.
This enables ignoring fields that are not needed without loading them into memory.
It also aloow treating the data as it comes instead of when it's fully loaded in memory.

Author: Julien Le Dem <julien@twitter.com> Closes apache#8 from julienledem/master and squashes the following commits: aa0d751 [Julien Le Dem] correct mailling-list 285b39c [Julien Le Dem] Update CONTRIBUTING.md bca9bc4 [Julien Le Dem] Create CONTRIBUTING.md

tsdeng · 2014-08-25T18:44:17Z

src/main/java/parquet/format/event/Consumers.java

Could you add a comment: The delegatingFieldConsumer delegates addField call to corresponding TypedConsumer?

Conflicts: src/main/java/parquet/format/Util.java

julienledem · 2014-08-28T17:43:12Z

@tsdeng I refactored the APIs a bit, added javadoc and resolved conflicts.

dvryaboy · 2014-08-28T21:40:59Z

I don't want to block this PR, but I do want to comment on the approach...

While this will likely address the immediate problem (memory pressure when creating splits), having this giant thrift object in a footer is still an issue. For one thing, we have to write it -- the committer that writes the joined footer is probably also under memory pressure since it has to have the merged thrift object in memory. Also, this makes reading just one file from a directory that has 400 unnecessarily expensive (as expensive as initializing a read for all 400 files).

The joined footer doesn't need to be a single object and we should probably move to a model where it's a sequence of objects, one per file, and we can choose which file's metadata we are reading.

julienledem · 2014-08-28T22:07:54Z

Agreed.
I think one approach will be to rework the metadata a bit so that we don't use the column names as identifiers but rather the column indices (the schema is already in there anyway).

The committer does only one partition at a time so it is much more manageable than when reading many partitions at once. (we had issues with a schema with thousands of columns reading dozens of partitions at a time).

One thrift containing a list of Thrift object versus a sequence of objects, doesn't make any difference in my opinion as we can easily handle the top level object without having to materialize the list as in this PR. This could be applied in the committer for write as well so that it would not load everything in memory.

I'm also proposing a new _common_metadata file, so that we don't have to skip the row groups when we don't need them:
https://github.com/apache/incubator-parquet-mr/pull/45/files#diff-e3b91b40045ac64f6dfd6318453af9c2R319

julienledem · 2014-08-28T22:09:04Z

@dvryaboy Also if you think this is a good first step toward fixing this issue, feel free to merge.

dvryaboy · 2014-08-28T22:19:13Z

I'm lost in layers of abstraction... :)

julienledem · 2014-08-28T22:29:24Z

maybe I should throw in a monad or two.

tsdeng · 2014-08-28T22:49:43Z

src/main/java/parquet/format/event/Consumers.java

Seems the Consumers here is just used as a namespace. So should it be a package name?

Missed, it's providing some static util methods

tsdeng · 2014-08-29T01:58:12Z

src/main/java/parquet/format/event/TypedConsumer.java

We can make TypedConsumer generic
and the add method could be of generic type T and is defined in the super class which is TypedConsumer:

abstract class TypedConsumer<T> { abstract public void consume(T value) }

advantage of this is removing the duplication of addXXX declaration for each XXConsumer

also after adding the Generic, the this.addXXX call can also be lifted to the super class. And each subclass only needs to define the T getValue method. The simplifies the code of TypedConsumer to only focus on 2 things:

get the value from protocol

call the consume method

So the code change would be:

abstract public static class TypedConsumer<T> { final public void read(TProtocol protocol, EventBasedThriftReader reader) { this.consume(getValue(protocol)) } abstract public T getValue(TProtocol p) //This method should be implemented in concreate TypedConsumer }

Also I prefer the name to be consume, rather than add. It's just consumer <-> consume reads better than consumer<->add

I renamed a bunch of stuff according to your comment.
for the Generic TypedConsumer they actually don't all have the same signature (see Struct, Set, List, Map) so I don't think we should do it.

tsdeng · 2014-08-29T22:21:28Z

src/main/java/parquet/format/event/TypedConsumer.java

Seems you missed this comment:

We can make TypedConsumer generic
and the add method could be of generic type T and is defined in the super class which is TypedConsumer:

abstract class TypedConsumer<T> { abstract public void consume(T value) }

advantage of this is removing the duplication of addXXX declaration for each XXConsumer

also after adding the Generic, the this.addXXX call can also be lifted to the super class. And each subclass only needs to define the T getValue method. The simplifies the code of TypedConsumer to only focus on 2 things:

get the value from protocol

call the consume method

So the code change would be:

abstract public static class TypedConsumer<T> { final public void read(TProtocol protocol, EventBasedThriftReader reader) { this.consume(getValue(protocol)) } abstract public T getValue(TProtocol p) //This method should be implemented in concreate TypedConsumer }

I did reply to this one:
not all TypedConsumers do consume(T value). See {Set,List,Struct,Map}Consumer so that would not apply in that case.

tsdeng · 2014-08-29T23:47:37Z

LGTM +1

julienledem · 2014-08-30T00:22:24Z

If you like it, please merge it.

dvryaboy · 2014-08-30T00:32:50Z

I'll merge.

dvryaboy · 2014-08-30T01:53:46Z

@julienledem using this to read footers is a separate issue, correct? This doesn't seem to.

julienledem · 2014-08-30T04:14:19Z

correct: https://github.com/apache/incubator-parquet-mr/pull/45

julienledem added 2 commits August 22, 2014 13:57

first stab

71c85de

streaming thrift reader

6368bdc

refactor, cleanup

bee937a

julienledem changed the title ~~add a streaming Thrift API, to enable processing the metadata as we read it and skipping unnecessary fields.~~ PARQUET-79: add a streaming Thrift API, to enable processing the metadata as we read it and skipping unnecessary fields. Aug 25, 2014

added simple readMetaData method

9be786a

tsdeng reviewed Aug 25, 2014
View reviewed changes

julienledem added 2 commits August 28, 2014 10:32

javadoc; fix apis

958726f

Merge branch 'master' into streaming_metadata

8dd801e

Conflicts: src/main/java/parquet/format/Util.java

tsdeng reviewed Aug 28, 2014
View reviewed changes

RIP TypedConsumerProvider, @tsdeng did not like you

cb386ce

tsdeng reviewed Aug 29, 2014
View reviewed changes

julienledem added 3 commits August 29, 2014 09:35

#simplify

e5c78fc

rename add to consume

a58913d

cleanup refactoring

621769a

tsdeng reviewed Aug 29, 2014
View reviewed changes

asfgit closed this in addbbb9 Aug 30, 2014

PARQUET-79: add a streaming Thrift API, to enable processing the metadata as we read it and skipping unnecessary fields. #8

PARQUET-79: add a streaming Thrift API, to enable processing the metadata as we read it and skipping unnecessary fields. #8

Uh oh!

Conversation

julienledem commented Aug 23, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienledem commented Aug 28, 2014

Uh oh!

dvryaboy commented Aug 28, 2014

Uh oh!

julienledem commented Aug 28, 2014

Uh oh!

julienledem commented Aug 28, 2014

Uh oh!

dvryaboy commented Aug 28, 2014

Uh oh!

julienledem commented Aug 28, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsdeng commented Aug 29, 2014

Uh oh!

julienledem commented Aug 30, 2014

Uh oh!

dvryaboy commented Aug 30, 2014

Uh oh!

dvryaboy commented Aug 30, 2014

Uh oh!

julienledem commented Aug 30, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants